Infrastructure Operations Runbook

Hybrid Local LLM Deployment Blueprint

Target Hardware Mac Studio (2026 Architectural Standard) Neural Engine 16-Core Apple NPU
Processor M4 Max (16-Core CPU / 40-Core GPU) Storage Capacity 512GB High-Speed NVMe SSD
Unified Memory 64GB Unified RAM Network Baseline 10Gb Ethernet (Central Intranet Node)

This runbook establishes a highly optimized, enterprise-grade production environment for local LLM inference on Apple Silicon. By utilizing a hybrid model-serving stack—deploying upstream llama-server for foundational GGUF structures alongside Apple's mlx-lm framework—the system minimizes inference latencies while expanding architecture compatibility. A centralized LiteLLM Proxy layer handles unified routing and team usage analytics.

⚠️ CRITICAL ARCHITECTURAL BOUNDARY: The 64GB VRAM Cap

Apple Silicon allocates Unified Memory dynamically between system tasks and the GPU. For a 64GB configuration, the default system-assigned VRAM limit available to Metal is roughly 48GB. To prevent catastrophic performance degradation caused by disk swapping, the combined size of all concurrently active models across both engines must never exceed 42GB. Leave 6GB of safety margin for Key-Value (KV) cache expansion during long context execution windows.

Phase 1: Environment Orchestration & Base Setup

Execute these operations from a clean terminal instance on the Mac Studio. Ensure you are operating within a shell running native Apple Silicon architecture (arm64).

1. Install Developer Tooling & Package Manager

Install the Xcode Command Line Tools and Homebrew package manager sequentially:

# Install Apple command line tools xcode-select --install # Install Homebrew Package Manager /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # Evaluate Homebrew environment setup (Append to paths) echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile eval "$(/opt/homebrew/bin/brew shellenv)"

2. Establish System Directory Layout

Maintain consistent organization for binaries, environment variables, models, and analytical logs:

mkdir -p ~/local-ai/bin mkdir -p ~/local-ai/models/gguf mkdir -p ~/local-ai/models/mlx mkdir -p ~/local-ai/configs mkdir -p ~/local-ai/logs

Phase 2: Compiling Native Upstream `llama-server`

Bypass secondary wrappers to unlock bleeding-edge optimizations (such as immediate support for new architectural quants and precise context manipulation) by compiling directly from source.

cd ~/local-ai git clone --depth 1 https://github.com/ggerganov/llama.cpp.git cd llama.cpp # Compile with native Metal (Apple Silicon GPU) acceleration enabled cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(sysctl -n hw.ncpu) # Move production binary into internal tool path cp build/bin/llama-server ~/local-ai/bin/

Phase 3: Deploying the Apple MLX Framework Environment

The MLX engine taps into native metal processing routines optimized specifically by Apple's machine learning engineering division, providing optimal tokens-per-second metrics for native 4-bit transformer scales.

cd ~/local-ai # Establish isolated Python 3.11/3.12 operational framework python3 -m venv venv-mlx source venv-mlx/bin/activate # Install high-performance wheel environments pip install --upgrade pip setuptools wheel pip install mlx-lm litellm[proxy]

Phase 4: Configuring the LiteLLM Gateway & Proxy Route

Create the centralized gateway routing configuration file. This orchestrates token aggregation, defines distinct models, and provisions custom team tokens.

Create Unified Mapping File

Generate a structural file at ~/local-ai/configs/litellm_config.yaml containing the mapping matrix:

model_list: - model_name: production-deep-context litellm_params: model: openai/gguf-model api_base: http://127.0.0.1:8080/v1 tpm: 100000 rpm: 1000 - model_name: production-ultra-fast litellm_params: model: openai/mlx-model api_base: http://127.0.0.1:8081/v1 tpm: 200000 rpm: 2000 litellm_settings: drop_params: true set_verbose: false general_settings: database_url: "sqlite:///~/local-ai/logs/litellm_usage.db" master_key: "sk_live_mac_studio_master_init_key_2026"

Phase 5: Launch Engineering & Process Management

For sustainable multi-engine routing, both background servers must be bound to loopback nodes on explicit ports, utilizing persistent background multiplexers (tmux) to maintain continuous operations.

Runtime Operational Parameter Rule: Before starting execution threads, ensure your models do not overlap their active weights beyond the system physical VRAM limitations outlined above.

Execution Commands (Admin Infrastructure Script)

Establish automated initialization routines within separate background screens:

# 1. Start Native GGUF Engine (Context optimized to 16k window, splitting 2 parallel worker allocation slots) tmux new-session -d -s engine-gguf '~/local-ai/bin/llama-server -m ~/local-ai/models/gguf/qwen2.5-32b-instruct-q4_k_m.gguf --port 8080 --host 127.0.0.1 -c 16384 -np 2' # 2. Start MLX Engine (High speed execution thread running Apple-native quant arrays) tmux new-session -d -s engine-mlx 'source ~/local-ai/venv-mlx/bin/activate && python3 -m mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8081 --host 127.0.0.1' # 3. Start LiteLLM Gateway Router (Exposes unified API endpoint out to the entire local network intranet) tmux new-session -d -s gateway-proxy 'source ~/local-ai/venv-mlx/bin/activate && litellm --config ~/local-ai/configs/litellm_config.yaml --port 4000 --host 0.0.0.0'

Your team members will now point all applications, IDE extensions (Cursor/VS Code), or standard UI web modules directly to the unified address: http://[MAC-STUDIO-INTERNAL-IP]:4000/v1

Phase 6: Long-Term Admin Operations & Maintenance

This section outlines routine management workflows, optimized for delegation to a junior engineer.

1. Sourcing and Adding New Models

2. Generating Virtual API Keys for Team Analytics

To provision specific API keys for tracking usage metrics across separate teams or individual developers, issue an authenticated request directly to the running LiteLLM database module:

curl -X POST "http://localhost:4000/key/generate" -H "Authorization: Bearer sk_live_mac_studio_master_init_key_2026" -H "Content-Type: application/json" -d '{"models": ["production-deep-context", "production-fast-chat"], "max_budget": 50.0, "user_id": "junior_dev_team_alpha"}'

3. Software Update Cadence & Maintenance Routines

Perform these system performance maintenance reviews every 30 days during off-peak hours:

# Update llama.cpp compile builds to absorb upstream speed increases cd ~/local-ai/llama.cpp && git pull cmake --build build --config Release -j$(sysctl -n hw.ncpu) && cp build/bin/llama-server ~/local-ai/bin/ # Update Python MLX frameworks source ~/local-ai/venv-mlx/bin/activate pip install --upgrade mlx-lm litellm
Pro-Tip: Monitoring VRAM and Thermals
Run sudo powermetrics --samplers cpu_power,gpu_power from the host terminal to inspect real-time watt draw and structural execution bounds of the hardware. Keep an eye on swapping metrics using vm_stat to ensure memory buffers remain perfectly inside the physical 64GB boundary.